Automatic Wrapper System for Semi- Structured Documents Based on Data Mining

نویسندگان

  • Irina RANCEA
  • Valentin SGÂRCIU
  • Irina Rancea
  • Valentin Sgârciu
چکیده

Lumea în care evoluăm presupune înţelegerea şi acumularea unei cantităţi imense de informaţie împărţită în diferite surse care necesită integrare şi sinteză. A apărut necesitatea unor aplicaţii inteligente, capabile să proceseze sau să colecteze automat informaţiile dorite. Acestea folosesc algoritmi de clusterizare pentru a descoperi grupuri. Totodată, datorită experienţei obţinute în timp în domeniul aplicaţiilor software tendinţa care se impune este de automatizare a proceselor, economisind astfel timp preţios al dezvoltatorilor, timp care poate fi folosit în proiectarea de noi concepte, arhitecturi. Lucrarea propune o îmbinare între descoperirea de informaţii în documente şi procesarea acestora în vederea automatizării proceselor software.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Tool for Semi-Automatic Generation and Maintenance of Taxonomies from Semi-Structured Documents

This chapter introduces OntoExtractor, a tool for the semi-automatic generation of the taxonomy from a set of documents or data sources. The tool generates the taxonomy in a bottom-up fashion. Starting from structural analysis of the documents, it produces a set of clusters, which can be refined by a further grouping created by content analysis. Metadata describing the content of each cluster i...

متن کامل

Learning Information Extraction Rules for Web Data Mining

The explosive growth and popularity of the World Wide Web has resulted in a huge number of information sources on the Internet. However, due to the heterogeneity and the lack of structure of Web information sources, access to this huge collection of information has been limited to browsing and keyword searching. Sophisticated Webmining applications, such as comparison shopping, require expensiv...

متن کامل

Automatic Extraction of Information Blocks Using PAT Trees

Information extraction from semi-structured Web documents is a critical issue for software agents on the Internet. Previous work in wrapper induction aim to solve this problem by applying machine learning to automatically generate extractors, but this approach still requires human intervention to provide training examples. In this paper, we present a novel approach that extracts information blo...

متن کامل

Populating Ontologies with Data from OCRed Lists

A flexible, accurate, and efficient method of automatically extracting facts from lists in OCRed documents and inserting them into an ontology would help make those facts machine searchable, queryable, and linkable and expose their rich ontological interrelationships. To work well, such a process must be adaptable to variations in list format, tolerant of OCR errors, and careful in its selectio...

متن کامل

A Structured Wrapper Induction System for Extracting Information from Semi-Structured Documents

We propose an extensible architecture which allows wrapper-learning systems to be easily constructed and tuned. In this architecture the bias of the wrapper-learning system is encoded as an ordered set of “builders”, each associated with some restricted extraction language L. To implement a new builder it is only necessary to implement a small set of core operations for L. Builders can also be ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012